The transition to digital learning has made available new sources of data, providing researchers new opportunities for understanding and improving STEM learning. Data sources such as digital learning environments and administrative data systems, as well as data produced by social media websites and the mass digitization of academic and practitioner publications, hold enormous potential to address a range of pressing problems in STEM Education, but collecting and analyzing text-based data also presents unique challenges.
Text Mining (TM) Module 1: Public Sentiment and the State Standards will help demonstrate how text mining can be applied in STEM education research and provide LASER Institute scholars hands-on experience with popular techniques for collecting, processing, and analyzing text-based data. Specifically, the four learning labs that make up this module address the following topics:
Learning Lab 1: Tidy Text, Tokens, & Twitter. We take a closer look at the literature guiding our analysis; wrangle our data into a one-token-per-row tidy text format; and use simple word counts to explore our tweets about the common core and next generation science standards.
Learning Lab 2: Twice the fun with Bigrams. For our second lab, we explore our unigrams, or single word tokens a little more, and also see what pairs of words and word correlations tell us about our tweets what insight they provide in response to our research questions.
Learning Lab 3: Come to the Dark Side. We focus on the use of lexicons in our third lab and introduce the {vader} package to compare the sentiment of tweets about the NGSS and CCSS state standards in order to better understand public reaction to these two curriculum reform efforts.
Learning Lab 4: A Tale of Two Standards. We wrap our look at public sentiment around STEM state curriculum standards by selecting an analysis that provides some unique insight; refining and polishing a data product; and writing a brief narrative to communicate findings in response to our research questions.
Text Mining Module 1 is guided by a recent publication by Rosenberg et al. (2020), Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards. This study in turn builds on upon previous work by Wang & Fikis (2017) examining public opinion on the Common Core State Standards (CCSS) on Twitter. For Module 1, we will focus on analyzing tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand key words and phrases that emerge, as well as public sentiment towards these two curriculum reform efforts.
While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them. To establish how public sentiment about this reform might be similar to or different from past efforts, we applied a suite of data science techniques to posts about the standards on Twitter from 2010-2020 (N = 571,378) from 87,719 users. Applying data science techniques to identify teachers and to estimate tweet sentiment, we found that the public sentiment towards the NGSS is overwhelmingly positive – 33 times more so than for the CCSS. Mixed effects models indicated that sentiment became more positive over time and that teachers, in particular, showed a more positive sentiment towards the NGSS. We discuss implications for educational reform efforts and the use of data science methods for understanding their implementation.
Similar to what we’ll be learning in this lab, Rosenberg et al. used publicly accessible data from Twitter collected using the Full-Archive Twitter API and the rtweet package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss,” “next generation science standard/s,” “next gen science standard/s.”
Data used in this lab was pulled using an Academic Research developer account and the {academictwitter} package, which uses the Twitter API v2 endpoints and allows researchers to access the full twitter archive. For those that created a standard developer account, the rtweet & the Twitter API supplemental learning lab will show you how to pull your own data from Twitter.
Data for this lab includes all tweets from 2020 that included the following terms: #ccss, common core, #ngsschat, ngss. Below is an example of the code used to retrieve data for this lab. It is set not to run and will not run if you try, but it does illustrate the search query used, variables selected, and time frame.
ccss_tweets_2021 <-
get_all_tweets('(#commoncore OR "common core") -is:retweet lang:en',
"2021-01-01T00:00:00Z",
"2021-05-31T00:00:00Z",
bearer_token,
data_path = "ccss-data/",
bind_tweets = FALSE)
ccss_tweets <- bind_tweet_jsons(data_path = "ccss-data/") %>%
select(text,
created_at,
author_id,
id,
conversation_id,
source,
possibly_sensitive,
in_reply_to_user_id)
write_csv(ccss_tweets, "data/ccss-tweets.csv")
Also similar to what we’ll demonstrate in Lab 3, the authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, they used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts. We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to exploring some other sentiment lexicons to see if they produce similar results.
The authors also used the lme4 package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teacher. We won’t look at the relationships between tweet sentiment, time and teachers in these labs, but we will take a look at the correlation between words within tweets in TM Learning Lab 2.
Summary of Key Findings
Finally, you can watch Dr. Rosenberg provide a quick 3-minute overview of this work at <https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x>
One overarching question that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, and that we’ll explore throughout the text mining labs this year, is the question:
How do we to quantify what a document or collection of documents is about?
The questions guiding the Rosenberg et al. study attempt to quantify public sentiment around the NGSS and how that sentiment changes over time. Specifically, they asked:
For our first lab on text mining in STEM education, we’ll use approaches similar to those used by the authors cited above to better understand public discourse surrounding these standards, particularly as they relate to STEM education. We will also try to guage public sentiment around the NGSS, by comparing how much more positive or negative NGSS tweets are relative to CSSS tweets. Specifically, in the next four learning lab we’ll attempt to answer the following questions:
As noted in our Getting Started activity, R uses “packages” and add-ons that enhance its functionality. One package that we’ll be using extensively is {tidyverse}. The {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. This shared features are sometimes “tidy data principles.”
Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Again, don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.
As we’ll learn first hand in this module, using tidy data principles can also make many text mining tasks easier, more effective, and consistent with tools already in wide use. The {tidytext} package helps to convert text into data frames of individual words, making it easy to to manipulate, summarize, and visualize text using using familiar functions form the {tidyverse} collection of packages.
Let’s go ahead and load the {tidytext} package:
library(tidytext)
For a more comprehensive introduction to the tidytext package, we cannot recommend enough the free online book, Text Mining with R: A Tidy Approach (Silge & Robinson, 2017). If you’re interested in pursuing text analysis using R post Summer Workshop, this will be a go to reference.
The importance of data wrangling, particularly when working with text, is difficult to overstate. Just as a refresher, wrangling involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al., 2018). Learning Lab 2 will have a heavy emphasis on preparing text for analysis and in particular we’ll learn how to:
read_csv() function for reading in our CCSS and NGSS tweetsselect() and filter() functions from {dplyr}, and introduce two new functions for merging the data frames that we imported.ccss_tweets <- read_csv("data/ccss-tweets.csv",
col_types = cols(author_id = col_character(),
id = col_character(),
conversation_id = col_character(),
in_reply_to_user_id = col_character()
)
)
Note the addition of the col_types = argument for changing some of the column types to character strings because the numbers for those particular columns actually indicate identifiers for authors and tweets:
author_id = the author of the tweet
id = the unique id for each tweet
converastion_id = the unique id for each conversation thread
in_reply_to_user_id = the author of the tweet being replied to
RStudio Tip: Importing data and dealing with data types can be a bit tricky, especially for beginners. Fortunately, RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process.
Try using the “Import Dataset” feature in the upper right environment pane to import the NGSS tweets located in the data folder.
The code generated should look something like this:
ngss_tweets <- read_csv("data/ngss-tweets.csv",
col_types = cols(author_id = col_character(),
id = col_character(),
conversation_id = col_character(),
in_reply_to_user_id = col_character()
)
)
Use the following code chunk to inspect your tweets using a function you’ve learned so for for viewing your data:
# your code here
As you may have noticed, we have more data than we need for our analysis and should probably pare it down to just what we’ll use.
Let’s start with the First, since this is a family friendly learning lab, let’s use the filter() function introduced in previous labs to filter out rows containing “possibly sensitive” language:
ccss_tweets_1 <- ccss_tweets %>%
filter(possibly_sensitive == "FALSE")
Now let’s use the select() function to select the following columns from our new ss_tweets_clean data frame:
text containing the tweet which is our primary data source of interestauthor_id of the user who created the tweetcreated_at timestamp for examining changes in sentiment over timeconversation_id for examining sentiment by conversationsid for the unique reference id for each tweet and useful for countsccss_tweets_2 <- ccss_tweets_1 %>%
select(text,
author_id,
created_at,
conversation_id,
id)
Note: The select() function will also reorder your columns based on the order in which you list them.
Use the code chunk below to reorder the columns to your liking and assign to ccss_tweets_3:
# your code here
Finally, since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column to quickly identify the set of state standards with which each tweet is associated.
We’ll use the mutate() function to create a new variable called standards to label each tweets as “ngss”:
ccss_tweets_4 <- mutate(ccss_tweets_2, standards = "ccss")
colnames(ccss_tweets_4)
## [1] "text" "author_id" "created_at" "conversation_id"
## [5] "id" "standards"
And just because it bothers me, I’m going to use the relocate() function to move the standards column to the first position so I can quickly see which standards the tweet is from:
ccss_tweets_5 <- relocate(ccss_tweets_4, standards)
colnames(ccss_tweets_5)
## [1] "standards" "text" "author_id" "created_at"
## [5] "conversation_id" "id"
Again, we could also have used the select() function to reorder columns like so:
ccss_tweets_5 <- ccss_tweets_4 %>%
select(standards,
text,
author_id,
created_at,
conversation_id,
id)
colnames(ccss_tweets_5)
## [1] "standards" "text" "author_id" "created_at"
## [5] "conversation_id" "id"
Before moving on to the CCSS standards, let’s use the %>% operator and rewrite the code from our wrangling so there is less redundancy and it is easier to read:
# Search Tweets
ccss_tweets_clean <- ccss_tweets %>%
filter(possibly_sensitive == "FALSE") %>%
select(text, author_id, created_at, conversation_id, id) %>%
mutate(standards = "ccss") %>%
relocate(standards)
head(ccss_tweets_clean)
## # A tibble: 6 x 6
## standards text author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 ccss "@catturd2 H… 1609854356 2021-01-02 00:49:28 13451697062071… 13451…
## 2 ccss "@homebrew15… 1249594897… 2021-01-02 00:40:05 13451533915976… 13451…
## 3 ccss "@ClayTravis… 8877070540… 2021-01-02 00:32:46 13450258639942… 13451…
## 4 ccss "@KarenGunby… 1249594897… 2021-01-02 00:24:01 13451533915976… 13451…
## 5 ccss "@keith3048 … 1252747591 2021-01-02 00:23:42 13451533915976… 13451…
## 6 ccss "Probably co… 1276017320… 2021-01-02 00:18:38 13451625486818… 13451…
Recall from section 1b. Define Questions that we are interested in comparing word usage and public sentiment around both the Common Core and Next Gen Science Standards.
Create an new ngss_tweets_clean data frame consisting of the Next Generation Science Standards tweets we imported by use the code above as a guide.
# your code here
Try not to peek at the answer below unless you are having difficulty with your code.
ngss_tweets_clean <- ngss_tweets %>%
filter(possibly_sensitive == "FALSE") %>%
select(text, author_id, created_at, conversation_id, id) %>%
mutate(standards = "ngss") %>%
relocate(standards)
head(ngss_tweets_clean)
## # A tibble: 6 x 6
## standards text author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 ngss "Please help… 3279907796 2021-01-06 00:50:49 13466201998945… 13466…
## 2 ngss "What lab ma… 1010324664… 2021-01-06 00:45:32 13466188701325… 13466…
## 3 ngss "I recently … 61829645 2021-01-06 00:39:37 13466173820858… 13466…
## 4 ngss "I'm thrille… 461653415 2021-01-06 00:30:13 13466150172071… 13466…
## 5 ngss "PLS RT. Exc… 22293234 2021-01-06 00:15:05 13466112069671… 13466…
## 6 ngss "Inspired by… 3317960226 2021-01-06 00:00:00 13466074140999… 13466…
Finally, let’s combine our CCSS and NGSS tweets into a single data frame by using the union() function from dplyr and simply supplying the data frames that you want to combine as arguments:
ss_tweets <- union(ccss_tweets_clean,
ngss_tweets_clean)
Note that when creating a “union” like this (i.e. stacking one data frame on top of another), you should have the same number of columns in each data frame and they should be in the exact same order.
Finally, let’s take a quick look at both the head() and the tail() of this new ss_tweets data frame to make sure it contains both “ngss” and “ccss” standards:
head(ss_tweets)
## # A tibble: 6 x 6
## standards text author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 ccss "@catturd2 H… 1609854356 2021-01-02 00:49:28 13451697062071… 13451…
## 2 ccss "@homebrew15… 1249594897… 2021-01-02 00:40:05 13451533915976… 13451…
## 3 ccss "@ClayTravis… 8877070540… 2021-01-02 00:32:46 13450258639942… 13451…
## 4 ccss "@KarenGunby… 1249594897… 2021-01-02 00:24:01 13451533915976… 13451…
## 5 ccss "@keith3048 … 1252747591 2021-01-02 00:23:42 13451533915976… 13451…
## 6 ccss "Probably co… 1276017320… 2021-01-02 00:18:38 13451625486818… 13451…
tail(ss_tweets)
## # A tibble: 6 x 6
## standards text author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 ngss @BK3DSci Bria… 558971700 2021-05-21 01:10:28 13955471161272… 13955…
## 2 ngss A1 My studen… 1449382200 2021-05-21 01:10:20 13955474728990… 13955…
## 3 ngss A1: It is an … 136014942 2021-05-21 01:09:58 13955473807585… 13955…
## 4 ngss @MsB_Reilly M… 3164721571 2021-05-21 01:09:54 13955471085775… 13955…
## 5 ngss A1.5 I also l… 14449947 2021-05-21 01:09:46 13955473306029… 13955…
## 6 ngss @MsB_Reilly W… 558971700 2021-05-21 01:09:44 13955471085775… 13955…
Wow, so much for a family friendly learning lab! Based on this very limited sample, which set of standards do you think Twitter users are more negative about?
Let’s take a slightly larger sample of the CCSS tweets:
ss_tweets %>%
filter(standards == "ccss") %>%
sample_n(20) %>%
relocate(text)
## # A tibble: 20 x 6
## text standards author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 "Congression… ccss 859900296 2021-02-08 15:57:10 13588070908155… 13588…
## 2 "@TPCarney S… ccss 262511590 2021-01-03 21:54:47 13458451342069… 13458…
## 3 "@miles_comm… ccss 21127293 2021-03-23 00:22:36 13741542180896… 13741…
## 4 "@KSekouM Wo… ccss 959119149… 2021-01-12 22:05:36 13490927676669… 13491…
## 5 "@SaunderHar… ccss 17705724 2021-05-30 17:21:08 13989422591235… 13990…
## 6 "@Misty4PHco… ccss 126663434… 2021-01-27 05:16:00 13542029417379… 13542…
## 7 "@RubinRepor… ccss 131210778… 2021-03-26 20:50:40 13755502079880… 13755…
## 8 "[Read] Mobi… ccss 135767214… 2021-02-14 10:33:17 13608999119280… 13608…
## 9 "Easy, it's … ccss 100983564… 2021-03-25 20:50:58 13751884828937… 13751…
## 10 "The FACT re… ccss 100293870… 2021-02-26 18:10:44 13653636851375… 13653…
## 11 "@IM_Communi… ccss 121818485… 2021-02-25 14:25:18 13649248212603… 13649…
## 12 "@SergioVeng… ccss 748990700… 2021-03-31 03:51:22 13770642420845… 13771…
## 13 "DEFCON 3 [M… ccss 717455332… 2021-02-01 23:29:04 13563840993348… 13563…
## 14 "@Lisak52 @D… ccss 117216475… 2021-03-23 00:22:44 13741520282519… 13741…
## 15 "@disneyglim… ccss 123349714… 2021-04-16 14:10:49 13830496164180… 13830…
## 16 "@SpotterBre… ccss 62493830 2021-05-07 01:28:57 13904727678757… 13904…
## 17 "@thomaschat… ccss 113813211… 2021-05-25 23:19:07 13972391677179… 13973…
## 18 "@Bored_Teac… ccss 133124699… 2021-03-02 17:26:35 13667653477103… 13668…
## 19 "Read it on … ccss 22384582 2021-05-22 08:10:07 13960155030686… 13960…
## 20 "@swampymag … ccss 120168834… 2021-01-17 14:33:48 13506138151514… 13508…
Use the code chunk below to take a sample of the NGSS tweets:
# your code here
Still of the same opinion?
Text data by it’s very nature is ESPECIALLY untidy and is sometimes referred to as “unstructured” data. In this section we are introduced to the tidytext package and will learn some new functions to convert text to and from tidy formats. Having our text in a tidy format will allow us to switch seamlessly between tidy tools and existing text mining packages, while also making it easier to visualize text summaries in other data analysis tools like Tableau.
In Chapter 1 of Text Mining with R, Silge & Robinson (2017) define the tidy text format as a table with one-token-per-row, and explain that:
A token is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.
This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.
For this part of our workflow, our goal is to transform our ss_tweets data from this:
head(relocate(ss_tweets, text))
## # A tibble: 6 x 6
## text standards author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 "@catturd2 H… ccss 1609854356 2021-01-02 00:49:28 13451697062071… 13451…
## 2 "@homebrew15… ccss 1249594897… 2021-01-02 00:40:05 13451533915976… 13451…
## 3 "@ClayTravis… ccss 8877070540… 2021-01-02 00:32:46 13450258639942… 13451…
## 4 "@KarenGunby… ccss 1249594897… 2021-01-02 00:24:01 13451533915976… 13451…
## 5 "@keith3048 … ccss 1252747591 2021-01-02 00:23:42 13451533915976… 13451…
## 6 "Probably co… ccss 1276017320… 2021-01-02 00:18:38 13451625486818… 13451…
Into a “tidy text” one-token-per-row format that looks like this:
tidy_tweets <- ss_tweets %>%
unnest_tokens(output = word,
input = text) %>%
relocate(word)
head(tidy_tweets)
## # A tibble: 6 x 6
## word standards author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 cattur… ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 2 hmmmm ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 3 common ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 4 core ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 5 math ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 6 now ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
Later in the year, we’ll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.
As demonstrated above, the tidytext package provides the incredibly powerful unnest_tokens() function to tokenize text (including tweets!) and convert them to a one-token-per-row format.
Let’s tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze and take a look:
ss_tokens <- unnest_tokens(ss_tweets,
output = word,
input = text)
head(relocate(ss_tokens, word))
## # A tibble: 6 x 6
## word standards author_id created_at conversation_id id
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 cattur… ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 2 hmmmm ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 3 common ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 4 core ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 5 math ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
## 6 now ccss 16098543… 2021-01-02 00:49:28 1345169706207109… 13451703111…
There is A LOT to unpack with this function:
unnest_tokens() expects a data frame as the first argument, followed by two column names.word in this case).text.author_id and created_at, are retained.to_lower = FALSE argument to turn off if desired).Note: Since {tidytext} follows tidy data principles, we also could have used the %>% operator to pass our data frame to the unnest_tokens() function like so:
ss_tokens <- ss_tweets %>%
unnest_tokens(output = word,
input = text)
The unnest_tokens() function also has a specialized “tweets” tokenizer in the tokens = argument that is very useful for dealing with Twitter text. It retains hashtags and mentions of usernames with the @ symbol as illustrated by our @catturd2 friend who featured prominently in our the first CCSS tweet.
Rewrite the code above (you can check answer below) to include the token argument set to “tweets,” assign to ss_tokens_1, and answer the questions that follow:
# your code here
How many observations were our original ss_tweets data frame?
How many observations are there now? Why the difference?
Your code should look something like this:
ss_tokens_1 <- unnest_tokens(ss_tweets,
output = word,
input = text,
token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
head(ss_tokens_1)
## # A tibble: 6 x 6
## standards author_id created_at conversation_id id word
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… @catt…
## 2 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… hmmmm
## 3 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… common
## 4 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… core
## 5 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… math
## 6 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… now
Before we move any further let’s take a quick look at the most common word in our two datasets:
ss_tokens_1 %>%
count(word, sort = TRUE)
## # A tibble: 74,235 x 2
## word n
## <chr> <int>
## 1 common 26665
## 2 core 26470
## 3 the 25818
## 4 to 20478
## 5 and 15552
## 6 of 13106
## 7 a 12472
## 8 math 11788
## 9 is 11562
## 10 in 10076
## # … with 74,225 more rows
Well, many of these tweets are clearly about the ccss and math at least, but beyond that it’s a bit hard to tell because there are so many “stop words” like “the,” “to,” “and,” “in” that don’t carry much meaning by themselves.
Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The stop_words dataset in the {tidytext} package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.
Let’s take a closer the lexicons and stop words included in each:
View(stop_words)
anti_join FunctionIn order to remove these stop words, we will use a function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:
For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119.
Now let’s remove stop words that don’t help us learn much about what people are saying about the state standards.
ss_tokens_2 <- anti_join(ss_tokens_1,
stop_words,
by = "word")
head(ss_tokens_2)
## # A tibble: 6 x 6
## standards author_id created_at conversation_id id word
## <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… @catt…
## 2 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… hmmmm
## 3 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… common
## 4 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… core
## 5 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… math
## 6 ccss 1609854356 2021-01-02 00:49:28 1345169706207109… 13451703111… makes
Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the tweet_tokens dataset that match the stop_words dataset. Remember when we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset. However this wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.
Use the code chunk below to take a quick count of the most common tokens in our ss_tweets_2 data frame to see if the results are a little more meaningful
ss_tokens_2 %>%
count(word, sort = TRUE)
## # A tibble: 73,596 x 2
## word n
## <chr> <int>
## 1 common 26665
## 2 core 26470
## 3 math 11788
## 4 #ngsschat 3059
## 5 amp 2904
## 6 #ngss 2655
## 7 students 2559
## 8 science 2300
## 9 standards 2273
## 10 education 2174
## # … with 73,586 more rows
Notice that the nonsense word “amp” is among our high frequency words as well as some. We can create our own custom stop word list to to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.
Let’s create a custom stop word list by using the simple c() function to combine our words. We can the add a filter to keep rows where words in our word column do NOT ! match words %in% my_stopwords list:
my_stopwords <- c("amp", "=", "+")
ss_tokens_3 <-
ss_tokens_2 %>%
filter(!word %in% my_stopwords)
Let’s take a look at our top words again and see if that did the trick:
ss_tokens_3 %>%
count(word, sort = TRUE)
## # A tibble: 73,593 x 2
## word n
## <chr> <int>
## 1 common 26665
## 2 core 26470
## 3 math 11788
## 4 #ngsschat 3059
## 5 #ngss 2655
## 6 students 2559
## 7 science 2300
## 8 standards 2273
## 9 education 2174
## 10 school 2154
## # … with 73,583 more rows
Much better! Note that we could extend this stop word list indefinitely. Feel free to use the code chunk below to try adding more words to our stop list.
# your code here
Calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. In Section 3, we keep things relatively simple and focus on some simple data summaries:
Word Counts. We focus primarily on the use of word counts and briefly introduce word frequencies to help us identify word commonly used in tweets about the NGSS and CCSS curriculum standards.
Word Frequencies. We wrap up this lab and preview some data visualization work in later labs by creating a simple wordcloud to explore summarize and highlight key words among our tweets.
As highlighted in Word Counts are Amazing, an excellent post and blog by Ted Underwood at University of Illinois, one simple but powerful approach to text analysis is counting the frequency in which words occur in a given collection of documents, or corpus.
Word counts are a good example of a simple approach that illustrates the central question to text mining and natural language processing, introduced at the beginning:
How do we to quantify what a document or collection of documents is about?
So far, we’ve used the count() function from the {dplyr} package to look at word counts across our entire corpus of tweets.
Let’s use the same function to look at counts of the most common words by standards this time since one of our goals is to compare public sentiment between the two standards:
ss_tokens_3 %>%
count(standards, word, sort = TRUE)
## # A tibble: 81,133 x 3
## standards word n
## <chr> <chr> <int>
## 1 ccss common 26599
## 2 ccss core 26405
## 3 ccss math 11688
## 4 ngss #ngsschat 3058
## 5 ngss #ngss 2646
## 6 ngss science 1950
## 7 ccss education 1917
## 8 ccss kids 1821
## 9 ccss standards 1810
## 10 ccss school 1806
## # … with 81,123 more rows
Note that we included standards in our function to count how often each word occurs for each set of standards. For example, if you tab through the output, you will see that “students” is among the top words in both sets of standards, and occurs 1,432 times in the NGSS tweets and 1,127 times in the CCSS tweets.
Unsurprisingly words from our Twitter API search query are among the top words in each set of standards as well.
It’s a little difficult to directly compare the top words in each set since they are lumped together. Let’s use our filter() function again to just look at the CCSS tweets and save this for later to use in our Reach activity:
ccss_counts <- ss_tokens_3 %>%
filter(standards == "ccss") %>%
count(word, sort = TRUE)
ccss_counts
## # A tibble: 57,435 x 2
## word n
## <chr> <int>
## 1 common 26599
## 2 core 26405
## 3 math 11688
## 4 education 1917
## 5 kids 1821
## 6 standards 1810
## 7 school 1806
## 8 dont 1622
## 9 grade 1443
## 10 people 1410
## # … with 57,425 more rows
Now use the code below to get the counts for our NGSS tweets so we can compare the top words for each set of standards:
# your code here
What might the top words for each set up standards suggest about similarities and differences for how Twitter users talk about each? What might it suggest about public sentiment?
We saw above that the word “students” is among the top words in both sets of standards, but to help facilitate comparisons, is often helpful to look at the frequency that each word occurs among all words for that document group. This will also helps us to better gauge how prominent each word is for each set of standards.
For example, let’s create counts for each standards and word paring like we did above, and then create a new column using the mutate() function that calculates the proportion that word makes up among all words:
ccss_frequencies <- ccss_counts %>%
mutate(proportion = n / sum(n))
ccss_frequencies
## # A tibble: 57,435 x 3
## word n proportion
## <chr> <int> <dbl>
## 1 common 26599 0.0773
## 2 core 26405 0.0767
## 3 math 11688 0.0339
## 4 education 1917 0.00557
## 5 kids 1821 0.00529
## 6 standards 1810 0.00526
## 7 school 1806 0.00525
## 8 dont 1622 0.00471
## 9 grade 1443 0.00419
## 10 people 1410 0.00410
## # … with 57,425 more rows
Now use the code below to get the frequencies for our NGSS tweets so we can compare the top words for each set of standards:
# your code here
We can see in both cases that our search terms are heavily skewing our proportions. What might we do to address this?
As highlighted in Chapter 3 of Data Science in Education Using R , the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful. In TM Learning Lab 3, we’ll take a closer look at the study by Rosenberg et al. (2020) and how they used modeling to compare differences in sentiment between teachers and non-teachers discussing the common core.
Congratulations - you’ve completed the first text mining learning lab! To complete your work, you can click the drop down arrow at the top of the file, then select “Knit top HTML.” This will create a report in your Files pane that serves as a record of your code and its output you can open or share.
If you wanted, you could save the processed data set to your data folder. The write_csv() function is useful for this. The following code is set to not run, as we wanted to ensure that everyone had the data set needed to begin the second learning lab, but if you’re confident in your prepared data, you can save it with the following:
write_csv()
If you’re using data that you brought to the institute or data that you pulled from Twitter, try tidying your data into a tidy text format and examining the top words in your dataset.
If you’d like to use the data we’ve been working with for your reach, let’s try some basic data visualization with text. The wordcloud2 package is pretty dead simple tool for generating HTML based word clouds.
For example, let’s load the wordclouds2 library, and run the wordcloud2() function on our ccss_counts data frame:
library(wordcloud2)
wordcloud2(ccss_counts)
As you can see, “math” is a pretty common topic with discussing the common core on twitter but words like “core” and “common” are not very helpful since those were in our search terms when pulling data from Twitter.
In a separate r script file, try modifying our list of stop words, retidying our text, and doing a new word count with these and perhaps other words removed. Also, take a look at the help file for wordclouds2 to see if there might be otherwise you could visually improve this visualization.
Word clouds are much maligned and sometimes referred to as pie charts for words, but they can be useful for quickly summarizing and communicating qualitative data for education practitioners and are intuitive for them to interpret. Also, for better or worse, these are now included as a default viz for open-ended survey items in online Qualtrics reports.
In this learning lab, we focused on the literature guiding our analysis; wrangling our data into a one-token-per-row tidy text format; and using simple word counts and frequencies to compare common words used in tweets about the NGSS and CCSS curriculum standards. Below, add a few notes in response to the following prompts:
One thing I took away from this learning lab:
One thing I want to learn more about:
Note: citations embedded in R Markdown will only show upon knitting.